Skip to content

feat: verda cloud (gpu/ai) provider support#142

Merged
rafeegnash merged 23 commits intomasterfrom
feat/verda-provider
Apr 20, 2026
Merged

feat: verda cloud (gpu/ai) provider support#142
rafeegnash merged 23 commits intomasterfrom
feat/verda-provider

Conversation

@rafeegnash
Copy link
Copy Markdown
Collaborator

Summary

  • Adds Verda Cloud (ex-DataCrunch) as a first-class clanker provider, mirroring the shape of the existing cf/do/hetzner/vercel integrations.
  • New clanker verda command tree (list/get/action/balance + verda ask) plus clanker ask --verda with keyword routing.
  • Adds verda-instant Kubernetes cluster provider so clanker k8s can provision and pull kubeconfigs from Verda Instant Clusters.
  • Exposes clanker_verda_ask / clanker_verda_list over MCP.

Details

  • OAuth2 Client Credentials flow against https://api.verda.com/v1, with expires_in-driven refresh, 429 / Retry-After, 207 multi-status and {code,message} error decoding.
  • Credential resolution order matches other providers: ~/.clanker.yamlVERDA_* env → ~/.verda/credentials (written by verda auth login).
  • Pulls kubeconfig off an Instant Cluster's head node via SSH and rewrites the server: URL to the public IP.
  • Unit tests cover token caching, 429 retry, multi-status decode, and credentials-file parsing (YAML + flat).

Test plan

  • make fmt vet test-short build
  • ./bin/clanker verda --help shows the new tree
  • ./bin/clanker ask --help lists --verda
  • ./bin/clanker verda list instances against a real Verda account
  • ./bin/clanker verda ask "what's my balance and running GPUs?" against a real account
  • ./bin/clanker mcp --transport http --listen :39393 exposes clanker_verda_*

nash added 23 commits April 20, 2026 01:40
- release mutex during oauth token fetch to avoid deadlock risk
- honor context cancellation during retry backoffs
- proactively bound polling sleep to remaining deadline
- drop `offline` from terminal instance status so transient stops don't end polling early
- atomic rename on conversation history save so a crash mid-write can't corrupt it
- pass remote path to ssh as a separate argv token and reject shell metacharacters so a dynamic candidate list cannot inject commands
- add BatchMode=yes to ssh so kubeconfig reads fail fast on missing keys instead of prompting for a password
- lowercase uuid inputs before short-circuiting hostname lookup so pasted uppercase ids resolve correctly
- resolve hostname to instance id in verda action so users can pass names instead of uuids
- add container-deployments, job-deployments, container-types, secrets, file-secrets, registry-creds, and balance to verda list
- gather container and job deployment context for ask-mode prompts
- propagate cmd.Context to handleVerdaQuery so ctrl-c cancels in-flight api calls
- check json.Marshal error on action payload instead of discarding it
- uuid parsing, scp target splitting, kubeconfig server rewrite, and the nil-client guard now have coverage in internal/k8s/cluster
- ResolveInstanceID verifies hostname lookup, uuid short-circuit, uppercase normalisation, and unknown-name error path
- sleepCtx exercises both cancellation and zero-duration paths
- routing tests confirm verda keyword hits, datacrunch alias, default-provider fallback, no-false-positive on bare 'gpu', and LLM classification clearing the right providers
- tighten looksLikeUUID to lowercase-only since resolveClusterID already lowercases, matching the verda package mirror
cloudflare, k8s, gcp, azure, digitalocean, hetzner, aws, and iam branches all
leaked ctx.Verda through when the llm picked a different provider. a
keyword-inferred verda signal paired with an llm-picked aws query would
surface both flags and downstream code would run both paths.

add a parametric regression test so every sibling provider gets checked.
- mcp `clanker_verda_list` switch gains containers, jobs, secrets, file-secrets, and registry-creds — matches the cli surface added in the previous commit
- error on unknown resource now lists every supported value so users don't need to open docs
- `clanker verda balance` prints a human-readable `Balance: $42.17 USD` line in addition to the json body, skipped with --raw for scriptability
- mcp input schema description updated so agents that read the schema know the full enum
pickKubernetesClusterImage used to fall back silently to the default cluster
image when no image carried a kubernetes/k8s hint. the cluster would then
provision without k8s installed, and GetKubeconfig would later fail opaquely
looking for /root/.kube/config. return the label-match as a second boolean
so Create can emit a visible warning with the image id in the log, giving
the caller a chance to abort or inject a startup script.
previously any path that reached RunVerdaCLI* surfaced a generic exec
failure like "exec: not found". with this change:

- new typed sentinel ErrCLINotInstalled + IsCLINotInstalled helper so
  callers can branch without string matching. the error message now points
  at the docs + brew install + REST fallback rather than a bare "not found"
- CLIInstalled() preflight lets callers show cli-aware ux eagerly instead
  of waiting for a real exec error
- resolveVerdaCredentials suggests `verda auth login` only when the binary
  is actually installed; otherwise it drops that line so the user isn't
  steered toward a command they can't run
- k8s agent auto-registration now logs a specific reason when the provider
  is skipped (no creds / partial creds / NewClient failed) at debug level
  so users investigating "verda-instant missing from cluster types" get a
  breadcrumb

tests cover ErrCLINotInstalled returning from a stubbed empty PATH, the
wrapped error message mentioning docs.verda.com and the REST fallback,
and CLIInstalled returning false on an empty PATH.
two concurrent `clanker ask --verda` invocations for the same project could
race on the tmp-file + rename, dropping one conversation's history without
returning an error. add a package-level map of per-scope sync.Mutex so both
Save and Load take the same lock before touching the file — write barrier
first, then the existing struct RWMutex for in-memory state.

tests cover the save/load round trip, a 20-goroutine storm against the same
scope that asserts the final file parses as valid json, and a leak check
that confirms only the final file remains (no stray verda_*.tmp). go test
-race is clean.
matches the pattern resolveHetznerToken / resolveVercelToken already use.
when local resolution (config → env → ~/.verda/credentials) comes up empty
and a clanker backend api key is configured, we now try the backend
credential store. a 404 from the backend (because the server-side route
may not be provisioned yet) gracefully falls through to the existing
human-readable "not configured" error.

- new VerdaCredentials type + ProviderVerda constant in internal/backend/types.go
- GetVerdaCredentials + StoreVerdaCredentials client methods following the
  existing Vercel/Hetzner shape so `clanker credentials store verda ...`
  can be added server-side later without client updates
- resolveVerdaCredentialsWithContext keeps the sync-safe resolveVerdaCredentials
  wrapper for non-ctx callers, but handleVerdaQuery now uses the ctx variant
  so cancellation during the backend round-trip is honoured
the ask-mode backend fallback we just added reads from the clanker backend
credential store, but users had no way to push verda creds to it locally.
this closes that loop.

- storeVerdaCredentials reads client-id/secret/project-id from --flag, then
  verda.Resolve* chain (viper → env → ~/.verda/credentials) and uploads to
  PUT /api/v1/cli/credentials/verda via the new backend client method
- --client-id / --client-secret / --project-id flags added to the store
  command with the other provider flags
- help text and usage examples updated to include verda
- friendly error when both credential fields are missing points at every
  configuration path (flags, yaml, env, `verda auth login`)
`clanker credentials test verda` pulls the stored client_id/secret from the
clanker backend, spins up a verda.Client, and hits /v1/balance as the
cheapest authenticated probe. debug mode prints the returned balance.
`clanker credentials delete verda` was accepting "verda" as a provider
string via the store path but never reached delete — routes ProviderVerda
now in runCredentialsDelete.
VerdaPlanPromptWithMode instructs the llm to emit a rest-first plan:
args are [verda-api, METHOD, /v1/path, body?] instead of a shell command.
this keeps maker execution independent of the verda cli (which may not be
installed) and reuses the well-understood verda oauth2 + retry client.

prompt covers the verda api's most common flows: list, create instance,
lifecycle actions (start/shutdown/delete/hibernate), create volume,
attach volume, create ssh key, create startup script, create instant
cluster with kubernetes image, discontinue cluster, check balance, and
enumerate instance-types. documents the binding format so the planner
can chain commands (INSTANCE_ID -> next command).
ExecuteVerdaPlan dispatches [verda-api, METHOD, /v1/path, body?] commands
directly through verda.Client — no shell-out, no cli dependency. the
existing oauth2 token caching, 429 backoff, and typed error decoding all
apply unchanged. plan bindings (<PLACEHOLDER>) and `produces` jsonpath
capture are wired through applyPlanBindings + learnPlanBindingsFromProduces
so multi-step plans (create instance -> start instance) compose cleanly.

validateVerdaCommand enforces:
- exactly 3-4 args [verda-api, METHOD, path, body?]
- method in GET/POST/PUT/PATCH/DELETE
- path prefix /v1/
- no newlines in any arg
- destructive operations (DELETE, action=delete|discontinue|force_shutdown|
  delete_stuck|hibernate) gated behind --destroyer

ExecOptions gains VerdaClientID/VerdaClientSecret/VerdaProjectID fields.
internal/verda exposes SetBaseURLForTest so the maker's test file can
redirect the executor at an httptest server without touching production.

tests cover shape validation, destructive gating (delete/discontinue pass
only with --destroyer; start passes always), and end-to-end execution with
a real httptest server — verifying oauth2 cache reuse (one token call
across two api calls) and jsonpath binding substitution from one command
into the next.
ask.go now routes --maker --verda through the full plan + apply cycle:

- drop the hard-coded "not yet supported for verda" guard
- explicitVerda selects makerProvider="verda" (explicit reason)
- svcCtx.Verda inference selects makerProvider="verda" (inferred reason)
- maker prompt switch calls maker.VerdaPlanPromptWithMode
- verda is included in the read-only "output only" provider list so
  --maker (without --apply) prints the plan without trying to run it
  through the aws enrichment pipeline
- --apply path: resolve verda credentials (backend-fallback-enabled) and
  call maker.ExecuteVerdaPlan with Client{ID,Secret} + ProjectID threaded
  through ExecOptions

end-to-end: `clanker ask --maker --verda "create one h100 in FIN-01"`
emits a json plan; `clanker ask --maker --verda --apply < plan.json` runs
the plan's verda-api commands in order, respecting `produces` bindings
and the --destroyer gate for delete/discontinue/force_shutdown actions.
@rafeegnash rafeegnash merged commit ee0e660 into master Apr 20, 2026
5 checks passed
@rafeegnash rafeegnash deleted the feat/verda-provider branch April 20, 2026 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant